AITopics | hybrid data

Collaborating Authors

hybrid data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FusionDP: Foundation Model-Assisted Differentially Private Learning for Partially Sensitive Features

Zeng, Linghui, Liu, Ruixuan, Sarkar, Atiquer Rahman, Jiang, Xiaoqian, Ho, Joyce C., Xiong, Li

arXiv.org Artificial IntelligenceNov-7-2025

Ensuring the privacy of sensitive training data is crucial in privacy-preserving machine learning. However, in practical scenarios, privacy protection may be required for only a subset of features. For instance, in ICU data, demographic attributes like age and gender pose higher privacy risks due to their re-identification potential, whereas raw lab results are generally less sensitive. Traditional DP-SGD enforces privacy protection on all features in one sample, leading to excessive noise injection and significant utility degradation. We propose FusionDP, a two-step framework that enhances model utility under feature-level differential privacy. First, FusionDP leverages large foundation models to impute sensitive features given non-sensitive features, treating them as external priors that provide high-quality estimates of sensitive attributes without accessing the true values during model training. Second, we introduce a modified DP-SGD algorithm that trains models on both original and imputed features while formally preserving the privacy of the original sensitive features. We evaluate FusionDP on two modalities: a sepsis prediction task on tabular data from PhysioNet and a clinical note classification task from MIMIC-III. By comparing against privacy-preserving baselines, our results show that FusionDP significantly improves model performance while maintaining rigorous feature-level privacy, demonstrating the potential of foundation model-driven imputation to enhance the privacy-utility trade-off for various modalities.

artificial intelligence, machine learning, sensitive feature, (17 more...)

arXiv.org Artificial Intelligence

2511.03806

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.90)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Improving Predictions on Highly Unbalanced Data Using Open Source Synthetic Data Upsampling

Krchova, Ivona, Platzer, Michael, Tiwald, Paul

arXiv.org Artificial IntelligenceJul-23-2025

Unbalanced tabular data sets present significant challenges for predictive modeling and data analysis across a wide range of applications. In many real-world scenarios, such as fraud detection, medical diagnosis, and rare event prediction, minority classes are vastly underrepresented, making it difficult for traditional machine learning algorithms to achieve high accuracy. These algorithms tend to favor the majority class, leading to biased models that struggle to accurately represent minority classes. Synthetic data holds promise for addressing the under-representation of minority classes by providing new, diverse, and highly realistic samples. This paper presents a benchmark study on the use of AI-generated synthetic data for upsampling highly unbalanced tabular data sets. We evaluate the effectiveness of an open-source solution, the Synthetic Data SDK by MOSTLY AI, which provides a flexible and user-friendly approach to synthetic upsampling for mixed-type data. We compare predictive models trained on data sets upsampled with synthetic records to those using standard methods, such as naive oversampling and SMOTE-NC. Our results demonstrate that synthetic data can improve predictive accuracy for minority groups by generating diverse data points that fill gaps in sparse regions of the feature space. We show that upsampled synthetic training data consistently results in top-performing predictive models, particularly for mixed-type data sets containing very few minority samples.

artificial intelligence, machine learning, minority class, (19 more...)

arXiv.org Artificial Intelligence

2507.16419

Genre: Research Report > New Finding (1.00)

Industry: Banking & Finance (0.71)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.48)

Add feedback

Enhancing Metabolic Syndrome Prediction with Hybrid Data Balancing and Counterfactuals

Shah, Sanyam Paresh, Mamun, Abdullah, Soumma, Shovito Barua, Ghasemzadeh, Hassan

arXiv.org Artificial IntelligenceMay-8-2025

Metabolic Syndrome (MetS) is a cluster of interrelated risk factors that significantly increases the risk of cardiovascular diseases and type 2 diabetes. Despite its global prevalence, accurate prediction of MetS remains challenging due to issues such as class imbalance, data scarcity, and methodological inconsistencies in existing studies. In this paper, we address these challenges by systematically evaluating and optimizing machine learning (ML) models for MetS prediction, leveraging advanced data balancing techniques and counterfactual analysis. Multiple ML models, including XGBoost, Random Forest, TabNet, etc., were trained and compared under various data balancing techniques such as random oversampling (ROS), SMOTE, ADASYN, and CTGAN. Additionally, we introduce MetaBoost, a novel hybrid framework that integrates SMOTE, ADASYN, and CTGAN, optimizing synthetic data generation through weighted averaging and iterative weight tuning to enhance the model's performance (achieving up to a 1.87% accuracy improvement over individual balancing techniques). A comprehensive counterfactual analysis is conducted to quantify the feature-level changes required to shift individuals from high-risk to low-risk categories. The results indicate that blood glucose (50.3%) and triglycerides (46.7%) were the most frequently modified features, highlighting their clinical significance in MetS risk reduction. Additionally, probabilistic analysis shows elevated blood glucose (85.5% likelihood) and triglycerides (74.9% posterior probability) as the strongest predictors. This study not only advances the methodological rigor of MetS prediction but also provides actionable insights for clinicians and researchers, highlighting the potential of ML in mitigating the public health burden of metabolic syndrome.

artificial intelligence, machine learning, metabolic syndrome, (17 more...)

arXiv.org Artificial Intelligence

2504.06987

Country: North America > United States > Arizona (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Effective and Efficient Federated Tree Learning on Hybrid Data

Li, Qinbin, Xie, Chulin, Xu, Xiaojun, Liu, Xiaoyuan, Zhang, Ce, Li, Bo, He, Bingsheng, Song, Dawn

arXiv.org Artificial IntelligenceOct-18-2023

Federated learning has emerged as a promising distributed learning paradigm that facilitates collaborative learning among multiple parties without transferring raw data. However, most existing federated learning studies focus on either horizontal or vertical data settings, where the data of different parties are assumed to be from the same feature or sample space. In practice, a common scenario is the hybrid data setting, where data from different parties may differ both in the features and samples. To address this, we propose HybridTree, a novel federated learning approach that enables federated tree learning on hybrid data. We observe the existence of consistent split rules in trees. With the help of these split rules, we theoretically show that the knowledge of parties can be incorporated into the lower layers of a tree. Based on our theoretical analysis, we propose a layer-level solution that does not need frequent communication traffic to train a tree. Our experiments demonstrate that HybridTree can achieve comparable accuracy to the centralized setting with low computational and communication overhead. HybridTree can achieve up to 8 times speedup compared with the other baselines.

efficient federated tree learning, hybrid data

arXiv.org Artificial Intelligence

2310.11865

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.60)

Add feedback

Hybrid data driven/thermal simulation model for comfort assessment

Barbedienne, Romain, Ouerk, Sara Yasmine, Yagoubi, Mouadh, Bouia, Hassan, Kaemmerlen, Aurelie, Charrier, Benoit

arXiv.org Artificial IntelligenceSep-4-2023

Machine learning models improve the speed and quality of physical models. However, they require a large amount of data, which is often difficult and costly to acquire. Predicting thermal comfort, for example, requires a controlled environment, with participants presenting various characteristics (age, gender, ...). This paper proposes a method for hybridizing real data with simulated data for thermal comfort prediction. The simulations are performed using Modelica Language. A benchmarking study is realized to compare different machine learning methods. Obtained results look promising with an F1 score of 0.999 obtained using the random forest model.

artificial intelligence, machine learning, thermal simulation model, (2 more...)

arXiv.org Artificial Intelligence

2309.01734

Genre: Research Report (0.40)

Industry: Energy > Oil & Gas > Upstream (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance

Zhu, Fengbin, Lei, Wenqiang, Huang, Youcheng, Wang, Chao, Zhang, Shuo, Lv, Jiancheng, Feng, Fuli, Chua, Tat-Seng

arXiv.org Artificial IntelligenceJun-1-2021

Hybrid data combining both tabular and textual content (e.g., financial reports) are quite pervasive in the real world. However, Question Answering (QA) over such hybrid data is largely neglected in existing research. In this work, we extract samples from real financial reports to build a new large-scale QA dataset containing both Tabular And Textual data, named TAT-QA, where numerical reasoning is usually required to infer the answer, such as addition, subtraction, multiplication, division, counting, comparison/sorting, and the compositions. We further propose a novel QA model termed TAGOP, which is capable of reasoning over both tables and text. It adopts sequence tagging to extract relevant cells from the table along with relevant spans from the text to infer their semantics, and then applies symbolic reasoning over them with a set of aggregation operators to arrive at the final answer. TAGOPachieves 58.0% inF1, which is an 11.1% absolute increase over the previous best baseline model, according to our experiments on TAT-QA. But this result still lags far behind performance of expert human, i.e.90.8% in F1. It is demonstrated that our TAT-QA is very challenging and can serve as a benchmark for training and testing powerful QA models that address hybrid form data.

paragraph, reasoning, tat-qa, (17 more...)

arXiv.org Artificial Intelligence

2105.07624

Country:

Asia > Singapore (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
North America > United States > Washington > King County > Seattle (0.04)

Genre: Research Report (0.50)

Industry:

Banking & Finance (0.68)
Information Technology > Services (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.61)

Add feedback